Human similar action recognition by fusing saliency image semantic features
نویسندگان
چکیده
目的 基于骨骼的动作识别技术由于在光照变化、动态视角和复杂背景等情况下具有更强的鲁棒性而成为研究热点。利用骨骼/关节数据识别人体相似动作时,因动作间关节特征差异小,且缺少其他图像语义信息,易导致识别混乱。针对该问题,提出一种基于显著性图像特征强化的中心连接图卷积网络(saliency image featureenhancement based center-connected graph convolutional network,SIFE-CGCN)模型。方法 首先,设计一种骨架中心连接拓扑结构,建立所有关节点到骨架中心的连接,以捕获相似动作中关节运动的细微差异;其次,利用高斯混合背景建模算法将每一帧图像与实时更新的背景模型对比,分割出动态图像区域并消除背景干扰作为显著性图像,通过预训练的 VGG-Net (Visual Geometry Group network)提取特征图,并进行动作语义特征匹配分类;最后,设计一种融合算法利用分类结果对中心连接图卷积网络的识别结果强化修正,提高对相似动作的识别能力。此外,提出了一种基于骨架的动作相似度的计算方法,并建立一个相似动作数据集。结果 实验在相似动作数据集与 NTU RGB+D 60/120(Nanyang Technological University 60/120)数据集上与其他方法进行比较。在相似动作数据集中,相比于次优模型识别准确率在跨参与者识别(X-Sub)和跨视角识别(X-View)基准分别提高 4.6% 和 6.0%;在 RGB+D60 数据集中,相比于次优模型识别准确率在 X-Sub X-View 基准分别提高 1.4% 0.6%;在 120 和跨设置识别(X-Set)基准分别提高 1.7% 1.1%。此外,进行多种对比实验,验证了中心连接图卷积网络、显著性图像提取方法以及融合算法的有效性。结论 提出的方法可以实现对相似动作的准确有效识别分类,且模型的整体识别性能及鲁棒性也得以提升。;Objective Human action recognition is a valuable research area in computer vision.It has wide range of applications, such as security monitoring, intelligent human-computer interaction, and virtual reality.The skeleton-based method first extracts the specific position coordinates major body joints from video or by using hardware software method.Then, skeleton information used for recognition.In recent years, received increasing attention because its robustness dynamic environments, complex backgrounds, occlusion situations.Early methods usually use hand-crafted features modeling.However, feature have poor generalization lack diversity extracted features.Deep learning become mainstream powerful automatic extraction capabilities.Traditional deep constructed data joint coordinate vectors pseudo-images, which are directly input into recurrent neural networks(RNNs) networks(CNNs)for classification.However, RNN-based CNN-based lose spatial structure limitation set European structure.Moreover, these cannot extract natural correlation human joints.Thus, distinguishing subtle differences between similar actions becomes difficult.Human naturally structured structures non-Euclidean space.Several works successfully adopted networks(GCNs)to achieve state-of-the-art performance skeletonbased methods, not explicitly learned.These crucial to recognizing actions.Moreover, shield object that interacts with humans only retain primary coordinates.The semantics reliance on sequences remarkably challenge actions.Method Given above factors, saliency enhancement network(SIFE-CGCN)is proposed this work recognition.The model GCN, can fully utilize temporal dependence joints.First, CGCN recognition.For dimension, center-connection topology designed establish connections all center capture small difference movements actions.For each frame associated previous subsequent frames sequence.Therefore, number adjacent nodes fixed at 2.The regular 1D convolution dimension convolution.A basic co-occurrence unit includes convolution, dropout layer.For training stability, residual connection added unit.The network formed stacking nine units.The batch normalization(BN)layer before beginning standardize data, global average pooling layer end unify dimensions.The dual-stream architecture utilizing bone simultaneously multiple angles.Given different roles actions, map focus main motion action.Second, selected Gaussian mixture background modeling method.Each compared real-time updated segment considerable changes, interference eliminated.The effective semantic maps images key actions.The Visual (VGG-Net) effectively objects images.In work, through pre-trained VGG-Net, connected matching.Finally, matching result strengthen revise improve ability actions.In addition, similarity calculation proposed, dataset established work.Result The stateof-the-art models Nanyang RGB+D(NTU RGB+D) 60/120 dataset.The comparison include CNN-based, RNN-based, GCN-based models.On crosssubject(X-Sub)and cross-view(X-View)benchmarks dataset, accuracy reach 80.3% 92.1%, 6.0% higher than accuracies suboptimal algorithm, respectively.The benchmarks 60 91.7% 96.9%.Compared improves 0.6%.Compared feedback (FGCN), 1.1% cross-setup(X-Set) respectively.In we conduct series comparative experiments show clearly effectiveness CGCN, method, fusion algorithm.Conclusion In study, propose SIFE-CGCN solve confusion when due ambiguity information.The experimental results recognize overall improved.
منابع مشابه
A Saliency Detection Model via Fusing Extracted Low-level and High-level Features from an Image
Saliency regions attract more human’s attention than other regions in an image. Low- level and high-level features are utilized in saliency region detection. Low-level features contain primitive information such as color or texture while high-level features usually consider visual systems. Recently, some salient region detection methods have been proposed based on only low-level features or hig...
متن کاملSingle Image Action Recognition by Predicting Space-Time Saliency
We propose a novel approach based on deep Convolutional Neural Networks (CNN) to recognize human actions in still images by predicting the future motion, and detecting the shape and location of the salient parts of the image. We make the following major contributions to this important area of research: (i) We use the predicted future motion in the static image (Walker et al., 2015) as a means o...
متن کاملHuman Action Recognition by Conceptual Features
Human action recognition is the process of labeling a video according to human behavior. This process requires a large set of labeled video and analyzing all the frames of a video. The consequence is high computation and memory requirement. This paper solves these problems by focusing on a limited set rather than all the human action and considering the humanobject interaction. This paper emplo...
متن کاملAction recognition by saliency-based dense sampling
Action recognition, aiming to automatically classify actions from a series of observations, has attracted more attention in the computer vision community. The state-of-the-art action recognition methods utilize dense sampled trajectories to build feature representations. However, their performances are limited due to action region clutters and camera motions in real world applications. No matte...
متن کاملAction Recognition with Image Based CNN Features
Most of human actions consist of complex temporal compositions of more simple actions. Action recognition tasks usually relies on complex handcrafted structures as features to represent the human action model. Convolutional Neural Nets (CNN) have shown to be a powerful tool that eliminate the need for designing handcrafted features. Usually, the output of the last layer in CNN (a layer before t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Image and Graphics
سال: 2023
ISSN: ['1006-8961']
DOI: https://doi.org/10.11834/jig.220028